Data Frame Selection and Indexing

We've seen how to call built-in data frames and how to create them using data.frame() along with vectors. Let's revisit our weather data frame and learn how to select elements from within the dataframe using bracket notation:

In [1]:
# Some made up weather data
days <- c('mon','tue','wed','thu','fri')
temp <- c(22.2,21,23,24.3,25)
rain <- c(TRUE, TRUE, FALSE, FALSE, TRUE)

# Pass in the vectors:
df <- data.frame(days,temp,rain)
In [2]:
df
Out[2]:
daystemprain
1mon22.2TRUE
2tue21TRUE
3wed23FALSE
4thu24.3FALSE
5fri25TRUE

We can use the same bracket notation we used for matrices:

df[rows,columns]

In [4]:
# Everything from first row
df[1,]
Out[4]:
daystemprain
1mon22.2TRUE
In [5]:
#Everything from first column
df[,1]
Out[5]:
  1. mon
  2. tue
  3. wed
  4. thu
  5. fri
In [6]:
# Grab Friday data
df[5,]
Out[6]:
daystemprain
5fri25TRUE

Selecting using column names

Here is where data frames become very powerful, we can use column names to select data for the columns instead of having to remember numbers. So for example:

In [8]:
# All rain values
df[,'rain']
Out[8]:
  1. TRUE
  2. TRUE
  3. FALSE
  4. FALSE
  5. TRUE
In [11]:
# First 5 rows for days and temps
df[1:5,c('days','temp')]
Out[11]:
daystemp
1mon22.2
2tue21
3wed23
4thu24.3
5fri25

If you want all the values of a particular column you can use the dollar sign directly after the dataframe as follows:

df.name$column.name

In [12]:
df$rain
Out[12]:
  1. TRUE
  2. TRUE
  3. FALSE
  4. FALSE
  5. TRUE
In [15]:
df$days
Out[15]:
  1. mon
  2. tue
  3. wed
  4. thu
  5. fri

You can also use bracket notation to return a data frame format of the same information:

In [14]:
df['rain']
Out[14]:
rain
1TRUE
2TRUE
3FALSE
4FALSE
5TRUE
In [18]:
df['days']
Out[18]:
days
1mon
2tue
3wed
4thu
5fri

Filtering with a subset condition

We can use the subset() function to grab a subset of values from our data frame based off some condition. So for example, imagin we wanted to grab the days where it rained (rain=True), we can use the subset() function as follows:

In [19]:
subset(df,subset=rain==TRUE)
Out[19]:
daystemprain
1mon22.2TRUE
2tue21TRUE
5fri25TRUE

Notice how the condition uses some sort of comparison operator, in the above case ==. Let's grab days where the temperature was greater than 23:

In [20]:
subset(df,subset= temp>23)
Out[20]:
daystemprain
4thu24.3FALSE
5fri25TRUE

Another thing to note is that we didn't pass in the column name as a character string, subset knows that you are referring to a column in that data frame.

Odering a Data Frame

We can sort the order of our data frame by using the order function. You pass in the column you want to sort by into the order() function, then you use that vector to select from the dataframe. Let's see an example of sorting by the temperature:

In [28]:
sorted.temp <- order(df['temp'])
In [29]:
df[sorted.temp,]
Out[29]:
daystemprain
2tue21TRUE
1mon22.2TRUE
3wed23FALSE
4thu24.3FALSE
5fri25TRUE

Let's take a look at what sorted.temp actually is:

In [30]:
sorted.temp
Out[30]:
  1. 2
  2. 1
  3. 3
  4. 4
  5. 5

Ok, so we are just asking for those index elements in that order (by default ascending, we can pass a negative sign to do descending order):

In [31]:
desc.temp <- order(-df['temp'])
In [32]:
df[desc.temp,]
Out[32]:
daystemprain
5fri25TRUE
4thu24.3FALSE
3wed23FALSE
1mon22.2TRUE
2tue21TRUE

We could have also used the other column selection methods we learned:

In [34]:
sort.temp <- order(df$temp)
df[sort.temp,]
Out[34]:
daystemprain
2tue21TRUE
1mon22.2TRUE
3wed23FALSE
4thu24.3FALSE
5fri25TRUE

That's it for data frames! We will definitely revisit this and explore data frames A LOT more, but we should test you understanding first! Up next an exercise!